options(knitr.table.format = "html")
knitr::opts_chunk$set(echo = TRUE, message = FALSE)
library(ggplot2)
library(tidyverse)
library(Biobase)
library(lavaan)
library(HIMA)
library(bama)
library(LUCIDus)
library(reshape2)
library(networkD3)
library(sm)
library(plotly)
library(janitor)
library(dplyr)
# read in data
work.dir = "/cloud/project/Labs/Mediation"
dat = readRDS(file.path(work.dir, "HELIX_data.rds"))
#loading plotting function for LUCID (highly customizable functions for complex LUCID model that are not built in the LUCIDus package)
source(file.path(work.dir, "/plotting_func/plot_lucid_in_parallel.R"))
source(file.path(work.dir, "plotting_func/plot_lucid_in_serial.R"))
source(file.path(work.dir, "plotting_func/plot_omics_profiles_general.R"))

6 Latent Unknown Clustering Integrating omics Data (LUCID)

LUCID allows user to incorporate and analyze multiple omics layers at the same time. LUCID is an integrative analysis, which jointly estimates latent clusters based on multiple omics layers and associates latent clusters with outcome of interest. In the lecture, we have covered the theoretical part of the LUCID model. We will go through the details of its application in a real world study.

6.1 Background

Baillie-Hamilton. J Altern Complement Med. 2002
Baillie-Hamilton. J Altern Complement Med. 2002

In 2002, physician Paula Baillie-Hamilton conducted a study revealing that the obesity epidemic correlated with the increased production of synthetic chemicals. Among them, organochlorines (OCs) constitute a major class of synthetic chemicals with suspected obesogenic properties. The effects of OC are biomagnified through the food chain, and especially affect the health outcome of pregnant woman and children.

OCs include synthetic chemicals that were widely used as pesticides and in industrial processes throughout most of the 20th century. The use of these chemicals are discontinued in the United States and Europe, however, because of their persistence in the environment, the general population is still exposed to these substances at low doses, and adverse health outcomes related to background population levels of exposure are a major concern. OCs are ubiquitous and persist in the environment, accumulate in high concentrations in fatty tissues, and are biomagnified through the food chain.

Pregnancy and childhood constitute periods of high vulnerability to chemical toxic effects, as it is when rapid tissue development occurs while there is incomplete development or function of protective mechanisms, such as xenobiotic metabolism and immune function

Research Question: 1. Explore the association between exposure to OCs and BMI (measured as a standardized BMI z-score) 2. Estimate the latent clusters in children characterized by exposome (exposure to OCs) and multiple omics layers (proteomics, serum metabolites and urinary metabolites); construct exposure and omics profiles for children with high risk of obesity

We use the LUCID model to analyze this research question, which can be expressed as the DAG below. Research Question

6.1 Prepare data for LUCID and conduct preliminary screening

First, let’s prepare data for LUCID. The LUCID model takes matrix, or matrix-like data (such as data.frame) as input. If input is vector, it will be transformed into a matrix automatically. Besides, since we incorporate multiple layers of omics data which are measured at different scales, it’s recommended to standardize multiple omics data to make sure they are at the same scale.

#===================================#
## 1. prepare data for LUCID model ##
#===================================#
exposure = dat[, 2:19]
omics = scale(as.matrix(dat[, 27:ncol(dat)])) # scale the omics
outcome = dat[, "hs_zbmi_who"]
cova_names = c("h_mbmi_None", "e3_sex_None", "h_age_None", "h_cohort", "h_edumc_None")
M_names = colnames(omics)
expo_names = colnames(exposure)
covariate = model.matrix(~., dat[, cova_names])[, -1]

Next, we conduct a preliminary screening to decrease the number of omics variables to a moderate number. Here we use the idea of pairwise association between exposure and omics layer, and between omics layer and the outcome. This is similar to the “meet in the middle” approach we conducted in the previous lab. Preliminary Screening ### Balanced Omics Feature Selection Using Modified M-i-M (FDR = 0.1)

We applied a two-step modified Meet-in-the-Middle (M-i-M) procedure to identify omics features that are both associated with exposures and predictive of the outcome. To ensure equal representation across omics layers, which was encouraged for early integration of multi-omics data. we selected exactly 10 features per layer, prioritizing exposure-responsive features but allowing fallback to top outcome-associated features when necessary.

  1. Step 1 (X → M | Y):
    For each omics layer, we test the association between each exposure and each omics feature, adjusting for the outcome and covariates. Features significantly associated with at least one exposure (FDR < 0.1) are flagged as exposure-responsive.

  2. Step 2 (M → Y):
    All features in each layer are tested for association with the outcome. Features are ranked by FDR-adjusted p-values. The top 10 per layer are selected, giving priority to exposure-responsive features but supplemented with top Y-associated features if needed.

This approach ensures 30 final features (10 per layer), balancing statistical relevance and biological plausibility across omics types.

#===========================#
## Group omics by layer    ##
#===========================#
omics <- as.data.frame(omics)
S_metab <- omics[, grep("^serum", names(omics), value = TRUE)]
U_metab <- omics[, grep("^urine", names(omics), value = TRUE)]
Proteins <- omics[, setdiff(names(omics), c(names(S_metab), names(U_metab)))]

omics_layers <- list(
  serum = as.matrix(S_metab),
  urine = as.matrix(U_metab),
  protein = as.matrix(Proteins)
)

#===========================#
## Layer-wise selection    ##
#===========================#
selected_M_by_layer <- list()

for (layer in names(omics_layers)) {
  layer_data <- omics_layers[[layer]]
  layer_M_names <- colnames(layer_data)
  
  ## Step 1: Exposure → M | Y
  res_M_E_list <- lapply(layer_M_names, function(m) {
    lapply(expo_names, function(e) {
      f <- as.formula(paste(m, "~", e, "+ hs_zbmi_who +", paste(cova_names, collapse = "+")))
      coefs <- summary(lm(f, data = dat))$coefficients
      data.frame(M = m, E = e, est = coefs[2, 1], p = coefs[2, 4])
    })
  })
  res_M_E <- do.call(rbind, unlist(res_M_E_list, recursive = FALSE))
  res_M_E$p.adj <- p.adjust(res_M_E$p, method = "fdr")
  
  responsive_M <- unique(res_M_E$M[res_M_E$p.adj < 0.1])
  
  ## Step 2: M → Y (on all features in the layer)
  res_M_Y_list <- lapply(layer_M_names, function(m) {
    f <- as.formula(paste("hs_zbmi_who ~", m, "+", paste(cova_names, collapse = "+")))
    coefs <- summary(lm(f, data = dat))$coefficients
    data.frame(M = m, p = coefs[2, 4])
  })
  res_M_Y <- do.call(rbind, res_M_Y_list)
  res_M_Y$p_fdr <- p.adjust(res_M_Y$p, method = "fdr")
  
  ## Prioritize responsive M, fill to 10
  res_M_Y <- res_M_Y %>%
    arrange(p_fdr) %>%
    mutate(priority = ifelse(M %in% responsive_M, 1, 2)) %>%
    arrange(priority, p_fdr)
  
  top10 <- head(res_M_Y$M, 10)
  selected_M_by_layer[[layer]] <- top10
}

# Final selected omics features across all layers
selected_M <- unlist(selected_M_by_layer)
cat("Total selected features:", length(selected_M), "\n")  # Should be 30
## Total selected features: 30

The preliminary screening selected 30 omics out of 257 omcis variables.

6.2 Conduct variable selection

Integrated Selection After preliminary screening, 101 proteins or metabolites are selected. The number is still moderately large and we hope to further decrease the number of omics variables to facilitate interpretation of the LUCID model. LUCID comes with an integrated variable selection based on \(L_1\) penalties. Next, we show how to tune penalty terms for LUCID and conduct integrated variable selection.

#===================================#
## 3. conduct variable selection   ##
#===================================#


# step 1: tune the model parameters
#=============================================================
# this chunk of codes is time-consuming, you can try it later;
# we use the results of this tuning process directly
#=============================================================
# set.seed(123)
# tune_lucid <- lucid(G = exposure,
#                     Z = omics_selected,
#                     Y = outcome,
#                     CoY = covariate,
#                     Rho_G = c(0, 0.05, 0.1),
#                     Rho_Z_Mu = c(0.05, 0.1, 0.15),
#                     Rho_Z_Cov = c(0.05, 0.1, 0.15),
#                     K = c(2:4),
#                     lucid_model = "early",
#                     init_omic.data.model = NULL)

# # check the optimal tuning parameters
# tune_lucid$tune_list

# step 2: integrated variable selections
#==========================================================================
#  optimal model: K = 4, Rho_Z_Cov = 0.05, Rho_Z_Mu = 0.05, Rho_G = 0.0
#==========================================================================
set.seed(123)
# Fit the LUCID model with early integration, using selected exposures and omics features
fit_try1 <- estimate_lucid(
  G = exposure,     # G: matrix of selected exposures (e.g., environmental variables)
  Z = omics[, selected_M],                # Z: matrix of selected omics features (e.g., metabolites)
  Y = outcome,                            # Y: outcome variable (e.g., BMI z-score)
  CoY = covariate,                        # CoY: covariates to adjust for in the Y model (e.g., age, sex, etc.)
  useY = FALSE,                            # Whether to use Y during model fitting (TRUE = supervised clustering)
  K = 4,                                  # Number of latent clusters to estimate
  family = "normal",                      # Distribution family for outcome Y ("normal" for continuous outcomes)
  Rho_Z_Cov = 0.1,                        # Penalty parameter for the covariance matrix of Z (encourages sparsity)
  Rho_Z_Mu = 1,                           # Penalty parameter for the cluster-specific mean of Z (for feature selection)
  Rho_G = 0,                              # Penalty parameter for the effect of G on cluster membership (0 = no penalty)
  lucid_model = "early",                 # Integration strategy: "early" = joint modeling of G, Z, and Y
  init_omic.data.model = NULL,           # Optional: specify mclust model for Z; NULL lets mclust choose the best model,
)
#check how many omics are selected out of 30
sum(fit_try1$select$selectZ)
#88 were selected

# refit the model with the selected features, we don't need penalty for this

set.seed(123)
fit_try2 = estimate_lucid(G = exposure,
                     Z = omics[, selected_M][, fit_try1$select$selectZ],
                     Y = outcome, 
                     CoY = covariate,
                     useY = FALSE, 
                     K = 4, 
                     family = "normal", 
                     lucid_model = "early",
                     init_omic.data.model = fit_try1$init_omic.data.model)

summary summarizes LUCID model and print out a table explaining associations among each components of LUCID. This summary table is very useful when a few number of environmental/genetic exposures and omic variables are included. Otherwise, it’s better to interpret LUCID model through a Sankey diagram, as we’ll show later.

summary(fit_try1)
## ----------Summary of the LUCID Early Integration model---------- 
##  
## K =  4 , log likelihood = -42645.59 , BIC =  112043.1 
##  
## (1) Y (continuous outcome): effect size of Y for each latent cluster (and effect of covariates if included) 
##                        Gamma
## cluster1         0.000000000
## cluster2        -0.692657964
## cluster3        -0.288355028
## cluster4         1.090351914
## h_mbmi_None      0.047112347
## e3_sex_Nonemale  0.231709786
## h_age_None       0.006758062
## h_cohort2       -0.523571477
## h_cohort3       -0.092012729
## h_cohort4        0.245864981
## h_cohort5        0.153513780
## h_cohort6        0.355271991
## h_edumc_None2    0.042926879
## h_edumc_None3    0.061160477
## 
## (2) Z: mean of omics data for each latent cluster 
##                 mu_cluster1 mu_cluster2  mu_cluster3 mu_cluster4
## serum_metab_95   0.00000000 -0.17215825 -0.286693679  0.42441303
## serum_metab_161  0.00000000 -0.17400752  0.017736264  0.35992051
## serum_metab_59   0.00000000 -0.02911857 -0.468116132  0.17055350
## serum_metab_85   0.00000000 -0.08925315 -0.478412885  0.29852582
## serum_metab_79   0.00000000 -0.03513718 -0.334050061  0.14758341
## serum_metab_102  0.00000000 -0.10441161 -0.147147154  0.25532372
## serum_metab_149  0.00000000 -0.07755897  0.181420559  0.11058572
## serum_metab_175  0.00000000 -0.04583465 -0.428024892  0.19690549
## serum_metab_103  0.00000000  0.00000000 -0.083816123  0.02037608
## serum_metab_119  0.00000000  0.03989559 -0.103377162 -0.04205404
## urine_metab_10   0.00000000 -0.03566755 -0.551701052  0.20214352
## urine_metab_6    0.03301145 -0.17996757 -0.479637434  0.48750798
## urine_metab_4    0.00000000 -0.09057700 -0.248641231  0.26406628
## urine_metab_13   0.00000000  0.11494312 -0.243420266 -0.17586179
## urine_metab_27   0.00000000  0.09737545  0.298586011 -0.28088029
## urine_metab_5    0.00000000 -0.17646288  0.657768106  0.20916555
## urine_metab_9    0.00000000 -0.02099490  0.150843619  0.00771194
## urine_metab_1    0.00000000 -0.06072911  0.213301717  0.06988198
## urine_metab_25   0.00000000  0.16541754 -0.036096912 -0.33258456
## urine_metab_31   0.00000000  0.07809542 -0.158557016 -0.13257482
## IL1beta          0.00000000 -0.49278647 -0.250859735  1.06690972
## IL6              0.00000000 -0.49580727  0.102448195  0.98939427
## Leptin           0.00000000 -0.32487488 -0.194574150  0.71602721
## IL10             0.00000000  0.11090282 -0.098998745 -0.20929624
## IP10             0.00000000  0.00000000  0.298630185 -0.07396379
## PAI1             0.00000000  0.03491634  0.015160507 -0.08682869
## INSULIN          0.00000000 -0.22220002 -0.226551099  0.50524929
## CRP              0.00000000 -0.21404398  0.004530880  0.43596969
## BAFF             0.00000000 -0.14754302  0.003426692  0.29424738
## HGF              0.00000000 -0.17066722  0.218637472  0.29904748
## 
## (3) E: odds ratio of being assigned to each latent cluster for each exposure 
##                                       beta           OR
## hs_dde_cadj_Log2.cluster2        1.8725909 6.505129e+00
## hs_dde_madj_Log2.cluster2       13.5935965 8.009842e+05
## hs_ddt_cadj_Log2.cluster2       -0.6103345 5.431692e-01
## hs_ddt_madj_Log2.cluster2        5.1910559 1.796582e+02
## hs_hcb_cadj_Log2.cluster2       -3.6801939 2.521808e-02
## hs_hcb_madj_Log2.cluster2        2.7955499 1.637163e+01
## hs_pcb118_cadj_Log2.cluster2     5.6700425 2.900469e+02
## hs_pcb118_madj_Log2.cluster2   -32.8708219 5.301313e-15
## hs_pcb138_cadj_Log2.cluster2    -0.3662958 6.932977e-01
## hs_pcb138_madj_Log2.cluster2    15.1025473 3.622038e+06
## hs_pcb153_cadj_Log2.cluster2    44.8250988 2.932872e+19
## hs_pcb153_madj_Log2.cluster2    -2.0265771 1.317858e-01
## hs_pcb170_cadj_Log2.cluster2     6.6531501 7.752225e+02
## hs_pcb170_madj_Log2.cluster2    20.1741172 5.774412e+08
## hs_pcb180_cadj_Log2.cluster2     7.5071144 1.820951e+03
## hs_pcb180_madj_Log2.cluster2    16.9756217 2.357322e+07
## hs_sumPCBs5_cadj_Log2.cluster2 -15.9016284 1.241682e-07
## hs_sumPCBs5_madj_Log2.cluster2  -8.6729947 1.711458e-04
## hs_dde_cadj_Log2.cluster3        1.7556918 5.787450e+00
## hs_dde_madj_Log2.cluster3       13.7352136 9.228423e+05
## hs_ddt_cadj_Log2.cluster3       -0.6546072 5.196462e-01
## hs_ddt_madj_Log2.cluster3        5.1713289 1.761488e+02
## hs_hcb_cadj_Log2.cluster3       -3.5838240 2.776931e-02
## hs_hcb_madj_Log2.cluster3        3.2466113 2.570309e+01
## hs_pcb118_cadj_Log2.cluster3     6.2212723 5.033432e+02
## hs_pcb118_madj_Log2.cluster3   -32.7298465 6.103912e-15
## hs_pcb138_cadj_Log2.cluster3    -0.1693120 8.442455e-01
## hs_pcb138_madj_Log2.cluster3    15.0201013 3.335394e+06
## hs_pcb153_cadj_Log2.cluster3    44.1175379 1.445451e+19
## hs_pcb153_madj_Log2.cluster3    -2.3810098 9.245717e-02
## hs_pcb170_cadj_Log2.cluster3     6.5885999 7.267626e+02
## hs_pcb170_madj_Log2.cluster3    19.7234115 3.679330e+08
## hs_pcb180_cadj_Log2.cluster3     7.6231962 2.045088e+03
## hs_pcb180_madj_Log2.cluster3    17.2171659 3.001374e+07
## hs_sumPCBs5_cadj_Log2.cluster3 -16.0336818 1.088079e-07
## hs_sumPCBs5_madj_Log2.cluster3  -8.3034871 2.476517e-04
## hs_dde_cadj_Log2.cluster4        1.7711077 5.877360e+00
## hs_dde_madj_Log2.cluster4       13.7252267 9.136718e+05
## hs_ddt_cadj_Log2.cluster4       -0.6148026 5.407476e-01
## hs_ddt_madj_Log2.cluster4        5.3517591 2.109791e+02
## hs_hcb_cadj_Log2.cluster4       -4.9394434 7.158582e-03
## hs_hcb_madj_Log2.cluster4        3.4996845 3.310501e+01
## hs_pcb118_cadj_Log2.cluster4     6.2967587 5.428096e+02
## hs_pcb118_madj_Log2.cluster4   -32.5629291 7.212728e-15
## hs_pcb138_cadj_Log2.cluster4     0.4445751 1.559827e+00
## hs_pcb138_madj_Log2.cluster4    15.0617613 3.477281e+06
## hs_pcb153_cadj_Log2.cluster4    44.1034375 1.425212e+19
## hs_pcb153_madj_Log2.cluster4    -1.7224810 1.786224e-01
## hs_pcb170_cadj_Log2.cluster4     6.5351135 6.889120e+02
## hs_pcb170_madj_Log2.cluster4    19.6213705 3.322408e+08
## hs_pcb180_cadj_Log2.cluster4     7.4160129 1.662392e+03
## hs_pcb180_madj_Log2.cluster4    17.1476956 2.799946e+07
## hs_sumPCBs5_cadj_Log2.cluster4 -16.4007073 7.538125e-08
## hs_sumPCBs5_madj_Log2.cluster4  -8.5630579 1.910342e-04

6.3 Prediction of LUCID model

After fitting the LUCID model with selected features, we can predict the cluster assignment for each observaton by calling the predict_lucid function.

# prediction of LUCID model
pred_fit_try2 = predict_lucid(model = fit_try2, 
                              G = exposure,
                              Z = omics[, selected_M][, fit_try1$select$selectZ],
                              CoY = covariate,
                              lucid_model = "early")
# prediction on latent cluster
table(pred_fit_try2$pred.x)
## 
##   0   1   2   3 
## 320  69 645 118

6.4 Visualize LUCID model

LUCID uses a Sankey diagram to visualize the complex associations among different components. In the Sankey diagram of LUCID, each node represents a variable in our dataset (for example, exposure to different organochlorines, various omics feature) and each link represents a statistical association. The color of the link indicates the direction of association: by default, light blue represents positive association while dark blue represents negative association. The width of the link indicates the magnitude of association: the wider a link is, the stronger the statisitcal association is.

#===================================#
## 3. visualize the LUCID model    ##
#===================================#

# Since there are still a high number of omics features (88) included in the final LUCID model, we can only visualize top 10 omics features based on the absolute values of differnce across the 2 clusters

# 1. use internal plot function
plot(fit_try2)
# 2. personalize color
# user can personalize colors of node and link
plot(fit_try2,
     G_color = "red",
     X_color = "blue",
     Z_color = "green",
     Y_color = "black",
     pos_link_color = "orange",
     neg_link_color = "gray")

6.5 Create omics and exposure profiles based on LUCID model

Distribtuion of BMI across 4 clusters

Outcome effect
Outcome effect
#========================================================================#
## 4. Distribution of zBMI score for each cluster predicted by LUCID    ##
#========================================================================#
Y_fit_try2 = as.data.frame(cbind(cluster = as.factor(pred_fit_try2$pred.x), dat[, "hs_zbmi_who"]))
Y_fit_try2 = melt(Y_fit_try2, id.vars = "cluster")
ggplot(Y_fit_try2, aes(x = as.factor(cluster), 
                       y = value, 
                       goup = as.factor(cluster))) +
  geom_boxplot() +
  xlab("cluster") +
  ylab("z-bmi")

To interpret the clustering results of LUCID, we can create box plot of BMI z-score for each latent luster. From the boxplot, we observe that children belonging to cluster 4 are at high risk of obesity while the first 3 clsuters tend to have a similar averaged BMI z-score.

Omics profiles for each identified cluster

Outcome profiles
Outcome profiles
#=============================================================#
## 5. Omics profiles for  each cluster predicted by LUCID    ##
#=============================================================#
M_mean = as.data.frame(fit_try2$res_Mu)
M_mean$cluster = as.factor(1:4)
M_mean_melt = melt(M_mean, id.vars = "cluster")
# add color label for omics layer
M_mean_melt$color_lable = rep("1", nrow(M_mean_melt))
M_mean_melt[grep("serum", M_mean_melt$variable), "color_lable"] = "2"
M_mean_melt[grep("urine", M_mean_melt$variable), "color_lable"] = "3"
ggplot(M_mean_melt, aes(fill = color_lable, y = value, x = variable)) +
  geom_bar(position="dodge", stat="identity") +
  ggtitle("Omics profiles for 4 latent clusters") +
  facet_wrap(~cluster) +
  facet_grid(rows = vars(cluster)) +
  theme(legend.position="none") +
  xlab("") +
  theme(text = element_text(size=10),
        axis.text.x = element_text(angle = 45, vjust = 1,
                                   hjust = 1))

This figure, the y-axis represents the estimated mean expression level for each metabolite. We could observe distinguished pattern of omic profiles for each subgroup. We want to specifically highlight 3 proteins: insulin, IL1-beta and IL-6. Latent cluster 4, the high risk subgroup, is particularly characterized with elevated levels of IL-1beta and IL-6, two key markers of systemic inflammation in the human body and higher levels of insulin, which is known to closely associate with obesity and underlying metabolic dysfunction.

Exposure profiels for each identified cluster

Exposure profiles
Exposure profiles
#================================================================#
## 6. Exposure profiles for  each cluster predicted by LUCID    ##
#================================================================#
E_mean = as.data.frame(fit_try2$res_Beta[, -1])
E_mean$cluster = as.factor(1:4)
E_mean_melt = melt(E_mean, id.vars = "cluster")
ggplot(E_mean_melt, aes(fill = variable, y = value, x = variable)) +
  geom_bar(position="dodge", stat="identity") +
  ggtitle("Exposure profiles for 4 latent clusters") +
  facet_wrap(~cluster) +
  facet_grid(rows = vars(cluster)) +
  geom_hline(data = data.frame(yint=0, cluster ="1"), aes(yintercept = yint), linetype = "dashed", color = "red") +
  theme(legend.position="none") +
  xlab("") +
  theme(text = element_text(size=10),
        axis.text.x = element_text(angle = 45, vjust = 1,
                                   hjust = 1))

This figure displays the log odds ratios of exposures across the four latent clusters. Each bar represents the strength and direction of association between a specific exposure and cluster membership, adjusted for covariates.

Notably, hs_hcb_madj_Log2 (maternal HCB) shows a strong gradient across clusters, increasing from cluster 1 to cluster 4, suggesting it plays a key role in distinguishing exposure profiles. This trend supports a potential obesogenic effect of maternal HCB, supporting previous HELIX findings (Vrijheid et al., EHP 2020).

Other exposures also contribute to the cluster structure, including: Cord HCB (hs_hcb_cadj_Log2), which decreases from cluster 2 to 4 linking higher early-life HCB exposure to lower BMI z-scores; PCBs 118, 138, and 153 in both maternal and cord plasma; And DDE/DDT, which also show moderate directional shifts across clusters.

These patterns underscore the complex mixture structure of environmental exposures in this cohort, which LUCID early integration model helped to reveal. Rather than being driven by a single compound, the latent clusters reflect combinatorial signatures, with maternal HCB as a particular contributor. This highlights the value of mixture modeling approaches for identifying exposure profiles that may influence downstream health outcomes like BMI.

Exploration

In the analysis above, we conducted early integration for multi-omics data by concatenating three layers (serum metabolites, urine metabolites and proteins). What will results differ if we conduct late integration (analyzing each omic layer separately then combine the results?) You can pick the selected exposure or any exposure of interest.

  1. Exposure: column 2 to 19, dat[, 2:19]
  2. Covariate: column 20 to 24, dat[, 20:24]
  3. Outcome: column 26, dat[, 26]
  4. Serum metabolites: column 27 to 203, dat[, 27:203]
  5. Urine metabolites: column 204 to 247, dat[, 204:247]
  6. Proteins: column 248 to 283, dat[, 248:283]

Intermediate and Late Integration

DAGs of LUCID Models
DAGs of LUCID Models

Three DAGs represent how three different types of the LUCID model integrate ge- netic/environmental exposures (G), other multi-omics data (Z), and the phenotype trait (Y). (a) LUCID early integration; (b) LUCID in parallel; (c) LUCID in serial.

In the previous sections, we focus on using the early integration strategy by concatenating each omic layer one by one and inputting a single data matrix into the LUCID model for estimation. However, researchers may be interested in modeling the correlation structure of each omic layer independently to investigate how multi-omics data may act in parallel with an outcome. Recent advancement of the LUCIDus package allows us to do it.

#================================================================#
## 7. Group different omic layers for selected omics features   ##
#================================================================#
omics_selected = as.data.frame(omics[,selected_M])
# Extract serum features
S_metab <- omics_selected[, grep("^serum", names(omics_selected), value = TRUE)]

# Extract urine features
U_metab <- omics_selected[, grep("^urine", names(omics_selected), value = TRUE)]

# Extract Protein features
Proteins <- omics_selected[,names(omics_selected)[!grepl("^serum|^urine", names(omics_selected))]]

#Put them in a list
omics_selected_list <- list(as.matrix(S_metab), as.matrix(U_metab), as.matrix(Proteins))
#================================================================#
## 8. Intermediate Integration: LUCID in Parallel    ##
#================================================================#

#Here, since we have a small number of features in each layer, we don't use regularity
fit_try3 = estimate_lucid(G = exposure, 
                     Z = omics_selected_list, 
                     Y = outcome, 
                     CoY = covariate,
                     useY = FALSE,  
                     K = c(2,2,2), #2 latent cluster for each layer
                     family = "normal",
                     lucid_model = "parallel")
#get the summary table of the model
#print.sumlucid(summary_lucid(fit_try3))

#add a prefic for proteins features as they don't have a common pattern
fit_try3$var.names$Znames[[3]] <- paste("protein", fit_try3$var.names$Znames[[3]], sep = "_")

#visualize LUCID in parallel
#Sankey Diagram
plot_lucid_in_parallel_plotly(fit_try3, 
                              sankey_colors = sankey_colors_parallel,
                              text_size = 10,
                              n_z_ftrs_to_plot = c(10,10,10))
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#omics profile
plot_omics_profiles(fit_try3, "intermediate",pattern_list = list("serum","urine","protein"), omics_list = list("serum","urine","protein"))

In a more late integration framework, LUCID can also be extended to incorporate multiple latent variables in a serial fashion if researchers believe that given an exposure, multi-omics data act serially through a multistep process towards the outcome. This framework can accommodate the following situations: (1) longitudinal measurements on the same multi-omics data type and (2) biological relationship of multi-omics data. The estimation of latent clusters for each omic layer can be formulated as an unsupervised LUCID early integration or LUCID in parallel sub-model.

#================================================================#
## 9. Late Integration: LUCID in Serial   ##
#================================================================#
#Here, since we have a small number of features in each layer, we don't use regularity
fit_try4 = estimate_lucid(G = exposure, 
                     Z = omics_selected_list, 
                     Y = outcome, 
                     CoY = covariate,
                     useY = FALSE,  
                     K = c(2,2,2), #2 latent cluster for each layer
                     family = "normal",
                     lucid_model = "serial")
## ...................................................................................................................................................................................................
#get the summary table of the model
#print.sumlucid(summary_lucid(fit_try4))

#add a prefic for proteins features as they don't have a common pattern

fit_try4$var.names$Znames[[3]] <- paste("protein", fit_try4$var.names$Znames[[3]], sep = "_")

#visualize LUCID in serial
sankey_in_serial(fit_try4,
                 color_pal_sankey_serial,
                 text_size = 10)
#omics profile
plot_omics_profiles(fit_try4, "late",pattern_list = list("serum","urine","protein"), omics_list = list("serum","urine","protein"))

Reference

  1. Peng, C., Wang, J., Asante, I., Louie, S., Jin, R., Chatzi, L., Casey, G., Thomas, D.C., and Conti, D.V. (2019). A Latent Unknown Clustering Integrating Multi-Omics Data (LUCID) with Phenotypic Traits. Bioinformatics.
  2. Zhao, Y., Jia, Q., Goodrich, J., Darst, B., and Conti, D.V. (2024). An Extension of Latent Unknown Clustering Integrating Multi-Omics Data (LUCID) Incorporating Incomplete Omics Data. Bioinformatics Advances. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11368387/
  3. Source codes for LUCIDus are hosted on https://github.com/USCbiostats/LUCIDus. It’s also availabe on CRAN